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Abstract Body. 


Background / Context: 

Single-case designs are a class of research designs for evaluating intervention effects on 
individual cases. The designs are widely applied in certain fields, including special education, 
school psychology, clinical psychology, social work, and applied behavior analysis. The multiple 
baseline design (MBD) is the most frequently used single-case design (Shadish & Sullivan, 

2011; Smith, 2012), and is often described as having desirable internal validity characteristics. 
The MBD involves periodically measuring an outcome variable on several cases in the absence 
of intervention, then introducing the intervention to each case in turn while continuing to 
measure the outcome. Two key features of the design thought to bear on its internal validity. The 
first feature is that the intervention is introduced deliberately by the researcher, rather than as a 
result of naturally occurring events (Horner et al., 2005; Kratochwill et al., 2012). In practice, 
researchers seldom use formal randomization of intervention starting points, though it is 
recognized that doing so strengthens internal validity (Kratochwill & Levin, 2010). The second 
key feature is that each case begins treatment at a different time, thus making it easier to rule out 
history-related threats (Kazdin, 2011; Kratochwill & Levin, 2010). 

In many of the disciplines that use MBDs, visual inspection of graphed outcome data is 
the primary method of analysis (Gast & Spriggs, 2010; Kazdin, 2011; Smith, 2012). Systematic 
procedures for creation of graphic displays and for visual analysis have been proposed (e.g., 
Horner, Swaminathan, Sugai, & Smolkowski, 2012) and incorporated in the What Works 
Clearinghouse procedures for reviewing single-case studies (Kratochwill et al., 2012). However, 
recent years have also seen renewed interest in statistical analysis of MBDs, particularly as a 
means for estimating summary effect sizes. Much of the statistical work has focused on the use 
of piece-wise regression models, either applied separately to each case (e.g., Center, Skiba, & 
Casey, 1985; Crosbie, 1993; Maggin et al., 201 1) or formulated as a hierarchical linear model 
(e.g., Hedges, Pustejovsky, & Shadish, 2012, 2013; Van den Noortgate & Onghena, 2003, 2008). 

Purpose / Objective / Research Question / Focus of Study: 

This paper examines how the use of certain hierarchical linear models that have been 
proposed for the MBD affect internal validity, focusing specifically on the two key features 
described above. First, I consider a deliberate but non-random method of treatment assignment. I 
demonstrate that the treatment effect estimate from a conventional multi-level model can be 
biased under this mechanism, then provide an expression for the magnitude of the bias. Second, I 
argue that the main analytic models proposed for the MBD fail to capture the benefits of 
staggered treatment introduction. I propose an alternative model that does account for this key 
feature and, based on the model, I define an index measuring the strength of control offered by 
the MBD. 

Significance / Novelty of study: 

It is widely recognized that parameter estimates from hierarchical linear models can be 
biased if the covariates are correlated with the higher-level error terms (e.g., Wooldridge, 2002). 
One contribution of the present investigation is to describe a specific model that produces such 
correlation and to characterize the size of the resulting bias. The other contribution is to 
introduce an analytic model that more closely captures the key features of the MBD. While 
similar models have been used in other disciplines, I believe that this one warrants greater 
attention for analysis of single-case studies. 
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Statistical, Measurement, or Econometric Model: 

The data from a MBD with m cases and N measurement occasions can be described as 
follows. Let Yjj denote the outcome measurement from the j th occasion for case i. Case i is in the 
baseline phase for the first B t outcome measurements, after which it begins the treatment phase 
for the remaining - 5, measurements. Let X u indicate the treatment status of case i on occasion 
j, so Xy = 0 for j < Bt and X,j = 1 for j > B,. 

A simple but commonly considered model for these data (see, e.g.. Hedges et al., 2013) 
assumes that cases vary in their baseline levels and that the treatment leads to a shift in the mean 
level of the outcome that is constant across cases: 

Y lt =/3 + SX IJ+ v, + t f , (1) 

where the within-case errors e n ,...,e iN have mean 0 and variance a 2 , and are assumed for sake of 
simplicity to be independent. The terms v v ...,v m represent case-specific variation in the mean 

level of the outcome during the baseline phase, and are assumed to be independent and normally 
distributed with mean 0 and variance r 2 . Note that the covariate is not case-centered so as to 
preserve the interpretation of r 2 as between-case variation in baseline outcomes. 

Selection mechanisms. In the hierarchical modeling framework, the parameters of 
Model (1) would typically be estimated using restricted maximum likelihood (for a 2 and r 2 ) and 
weighted least squares (WLS) for fi and the treatment effect <5. However, the WLS estimator of 8 
may be biased if treatment assignment times are selected using a deliberate but non-random 
method. One plausible method would be to “triage” cases according to the severity of their 
baseline outcome levels. Suppose that the experimenter plans to introduce the treatment at a 
fixed set of times where 1 < b x < b 2 < ■ ■ ■ < b m < T . Suppose further that the experimenter 


has knowledge of v 1 ,...,v m based on experience or outside diagnostic information. The cases are 

assigned to treatment in order of their baseline severity: for a treatment intended to raise the level 
of the outcome, the case with the lowest expected baseline level enters the treatment phase first 
and the case with the highest expected baseline level enters the treatment phase last. It follows 
that B i =b , where (r p ...,r m ) are the index ranks of v l ,...,v m . This selection mechanism induces 

dependence between the covariate and the case-specific errors because (X n ,...,X (W ) is a 

function of Bi, which depends on v,. I term this mechanism “triage on known baseline ranks,” in 
order to emphasize that selection is based on the known ranks of v, ,..., v m . 


Denote the WLS estimator based on Model (1) as 8 m . Given estimates of the variance 


2 2 ^ 

parameters a and r , it can be shown that the standardized bias of S (l) under triage on known 

baseline ranks is 
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where ju (l are the mean order statistics from m independent, unit-normal random 

l m J m ^.2 

variates, b = — V/? , w = — V/z (N-ty), and p = — . Figure 1 plots the magnitude of the 

m~{ t +cr 

bias as a function of the intra-class correlation p, for varying values of m and N. The bias is 
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always negative, is never larger in magnitude than -0.1, and is largest when the between-case 
variation is small relative to the within-case variation. 

Triage on known baseline ranks might be considered unrealistic because it assumes 
knowledge of the true case-level effects (or at minimum, their ranks). A more realistic 
mechanism would involve assigning treatment times to cases based on the baseline outcomes 
actually observed as the experiment progresses. Such a selection mechanism induces dependence 
between the covariate and both the case-level errors and the within-case errors, making it 
difficult to derive analytic expressions for the magnitude of the resulting bias. In on-going work, 
I am using simulation methods to examine the bias under this and other related mechanisms. 

Model for staggered treatment introduction. No aspect of Model (1) accounts for the 
other key feature of the MBD — that treatment introduction is staggered rather than simultaneous. 
Instead, the model is structurally identical to a model for a collection of replicated AB designs, 
which are considered to have lower internal validity than the MBD (Kratochwill et al., 2012). 
The same remark holds true for a variety of other models proposed for use with the MBD, 
including piece-wise regression models (e.g., Center et al., 1985; Crosbie, 1993; Maggin et al., 
2011) and other hierarchical linear models (e.g., Ferron, Bell, Hess, Rendina-Gobioff, & 

Hibbard, 2009; Shadish, Kyse, & Rindskopf, 2013; Van den Noortgate & Onghena, 2003, 2008). 

Heuristically, staggering the introduction of treatment allows the analyst to control for 
history threats that are common across cases (Shadish, Cook, & Campbell, 2002, Chapter 6). 
Such threats are called sometimes called “common shocks,” meaning influences that impact the 
outcome equally for all cases. To capture these common shocks, Model (1) can be modified to 
include fixed effects for each measurement occasion: 


Y^p.+SX^+v.+e,,, 


(3) 


This analytic models is applied commonly in the econometric literature on panel data (e.g., 
Wooldridge, 2002) and sometimes in the public health literature on “stepped wedge” trials (e.g., 
Hussey & Hughes, 2007); however, it has been entirely overlooked for analysis of MBDs. 

Three inter-related problems arise due to the focus on statistical models that do not 
capture this essential feature of the MBD. First, the naive statistical analysis will negate whatever 
improvements in internal validity that the feature provides. To illustrate, observed that the 
treatment effect estimator based on Model (1) will be biased in the presence of common shocks 
(even when m is large). Assuming that treatment times are determined independently of either 
the known or observed baseline levels, the exact bias is given by 


E 
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where bo = 0 and b„,+\ = N . Clearly, there is no reason to expect the bias to be zero unless each 
of the occasion-specific common shocks J3 1 ,...,J3 N is zero. 

Second, the naive statistical analysis is incongruent with best practices for visual analysis 
of the MBD, where a “vertical analysis” is used to assess whether the introduction of treatment 
for one case coincides with changes in the pattern of outcomes for other cases (Horner et al., 
2012). Including occasion-specific common shocks is the statistical analogue of the vertical 
analysis technique. 

Finally, Model (1) does not allow the analyst to quantify the extent to which cases are 
subject to common influences, even though this bears directly on internal validity. Intuitively, a 
design in which six cases are sampled from six separate schools does not provide the same 
strength of control as a design in which six cases are sampled from the same school. In the 
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former case, the only influences controlled are those that are common across schools; in the latter 
case, influences that are unique to the school may also be ruled out. Under Model (3), an index 
for quantifying the magnitude of common shocks can be defined by comparing the variation 
between estimated common shocks to the variation in the outcome (both within and across cases) 

on a given measurement occasion. Let , 6 x ,...,f3 N be the WLS estimates of the common shocks, 


with estimated covariance between shocks j and k denoted v jk = Var^,/^ j . Define the index 
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where /? = — ■ The Q s hock index is strictly positive, with a value of zero indicating no 




common shocks and larger values corresponding to greater control. 


Usefulness / Applicability of Method: 

I estimated Models (1) and (3) on each of four outcomes from a MBD reported by Laski, 
Charlop, and Shreibman (1988). Table 1 reports standardized treatment effect estimates from 
each model along with the value of the Qshock index and the p-value from a test of the hypothesis 
that /?,=/?,=••• = J3 N . For only one outcome measure — parent verbalization in free-play 

settings — is there strong statistical evidence of period-specific common shocks. For this 
outcome, controlling for common shocks substantially reduces the treatment effect estimate, 
from 1.57 s.d. to 0.42 s.d. The lack of evidence for common shocks in the other outcomes may 
suggest that the staggered treatment introduction in this design does not offer much additional 
control, or it may merely indicate low power to detect such shocks. 


Conclusions: 

Under triage on known baseline ranks, the bias of the treatment effect estimator can be 
removed either by treating the case-level intercepts as fixed effects or by group-centering the 
treatment indicator. Future work on statistical models for single-case designs will need to attend 
carefully to the possibility of bias induced by correlation between treatment assignment times 
and case-level characteristics. Further, it would be beneficial if applied single-case researchers 
described their treatment assignment procedures in greater detail; as the above example shows, 
“deliberate” assignment does not necessarily mitigate selection threats to internal validity. 

I have offered several arguments regarding the benefits of analytic models that control for 
common history threats. Compared to models without common shocks, these models have 
distinctly different implications for statistical power, which need to be explored further. While I 
have focused on perhaps the simplest possible specification, models that include more complex 
features (such as case-specific time trends) are also possible and warrant further study. Finally, 
further statistical and empirical work is needed in order to understand the properties of the Qshock 
index and the range of values likely to be observed in practice. 
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Appendix B. Tables and Figures 

Figure 1 . Standardized bias of weighted least squares estimator under the triage mechanism, for 
varying values of the intra-class correlation (/>), phase length (n), and number of cases (m). It is 
assumed that treatment assignment times are equally spaced, so that Z?, = i x n and the total 
number of measurement occasions is N = n x (m + 1). 
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Table 1. Standardized treatment effect estimates, Qshock index values, and p-values from test 
for no period-specific shocks, as applied to outcome measures from MBD by Laski, Charlop, & 
Shreibman (1988). 


Outcome measure 

Standardize 
estimate (st 

treatment effect 
tandard error) 

Qshock 

p -value for 

A=A=---=Av 

Model (1) 

Model (3) 

Child vocalizations - 
free play setting 

1.47 (0.12) 

1.46 (0.25) 

0.0 

0.909 

Parent verbalizations 
- free play setting 

1.76 (0.11) 

1.60 (0.24) 

0.0 

0.994 

Child vocalizations - 
break room setting 

1.35 (0.16) 

0.96 (0.37) 

0.0 

0.676 

Parent verbalizations 
- break room setting 

1.57 (0.22) 

0.42 (0.40) 

19.5 

0.003 
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